The Bayesian Information Criterion 23 Model Selection in Censored Survival

نویسندگان

  • Chris T. Volinsky
  • Adrian E. Raftery
چکیده

We investigate the Bayesian Information Criterion (BIC) for variable selection in models for censored survival data. Kass and Wasserman (1995) showed that BIC provides a close approximation to the Bayes factor when a unit-information prior on the parameter space is used. We propose a revision of the penalty term in BIC so that it is de ned in terms of the number of uncensored events instead of the number of observations. For the simplest censored data model, that of exponential distributions of survival times (i.e. a constant hazard rate), this revision results in a better approximation to the exact Bayes factor based on a conjugate unit-information prior. In the Cox proportional hazards regression model, we propose de ning BIC in terms of the maximized partial likelihood. Using the number of deaths rather than the number of individuals in the BIC penalty term corresponds to a more realistic prior on the parameter space, and is shown to improve predictive performance for assessing stroke risk in the Cardiovascular Health Study. Key words: Bayes factor; Cox proportional hazards model; Exponential distribution; Partial likelihood; Variable selection. Contents 1 Introduction 1 2 The Bayesian Information Criterion 2 3 Model Selection in Censored Survival Models 3 3.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Exponential Survival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 The Cox Proportional Hazards Model . . . . . . . . . . . . . . . . . . . . . . 5 4 Quantitative Assessment of the Unit Information Priors 7 5 Example 8 5.1 The Cardiovascular Health Study . . . . . . . . . . . . . . . . . . . . . . . . 8 5.2 Bayesian Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 6 Discussion 13 List of Tables 1 Comparison of BIC Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 n vs. d: Comparison of CHS Results . . . . . . . . . . . . . . . . . . . . . . 12 List of Figures 1 Standardized Coe cients from the Three Studies . . . . . . . . . . . . . . . 9 2 n vs. d: Partial Predictive Scores from 100 Splits . . . . . . . . . . . . . . . 11 i 1 Introduction The Bayesian framework for hypothesis testing uses Bayes factors to quantify the evidence for one hypothesized model against another (Kass and Raftery 1995). Schwarz (1978) derived the Bayesian Information Criterion (or BIC) as a large sample approximation to twice the logarithm of the Bayes factor. For a model Mj parameterized by an mj-dimensional vector j, BIC = 2f`j(̂j) `0(̂0)g+ (mj m0) log(n); (1) where `j(̂j) and `0(̂0) are the maximized likelihoods under Mj and a reference model M0, whose parameter has dimension m0, where n is the sample size. If M0 is nested within Mj, 2[`j(̂j) `0(̂0)] is the standard likelihood ratio test (LRT) statistic for testing M0 against Mj, and (mj m0) is the number of degrees of freedom associated with that test. Thus, if the models are nested, BIC is equal to the standard LRT minus a complexity penalty which depends on the degrees of freedom of the test. If BIC< 0, Mj is favored over M0, and the more negative BIC is, the more Mj is favored. BIC provides an approximation to the Bayes factor which can readily be computed from the output of standard statistical software packages. It has been widely used as a statistical model selection criterion; see, e.g., Kass and Raftery (1995), Raftery (1995), and references therein. The derivation of BIC involves a Laplace approximation to the Bayes factor, and ignores terms of constant order, including those from the prior, which are dominated by terms from the likelihood when the sample is large enough. Asymptotically, BIC favors the \correct" model with a probability that tends to 1 as sample size increases, but the di erence between BIC and twice the log Bayes factor does not vanish asymptotically in general, although it becomes inconsequential in large samples. However, Kass and Wasserman (1995) show that under certain non-restrictive regularity conditions, the di erence between BIC and twice the log Bayes factor does tend to zero for a speci c choice of prior on the parameters. They argue that this implicit prior is a reasonable one. Kass and Wasserman (1995) note that the \sample size" n which appears in the penalty term of (1) must be carefully chosen. Raftery (1995) discusses the use of BIC in several standard statistical models and notes that the choice of n is often not obvious. For censored survival models such as the proportional hazards model of Cox (1972), subjects contribute widely varying amounts of information to the likelihood. Although all n subjects are incorporated in the likelihood, most of the information comes from the uncensored observations, the ones that have experienced an event. We have found that substituting d, the number 1 of uncensored events, for n, the total number of individuals, in BIC results in an improved criterion without sacri cing the asymptotic properties shown by Kass and Wasserman (1995). 2 The Bayesian Information Criterion Standard Bayesian testing procedures use the Bayes factor (BF), which is the ratio of integrated likelihoods for two competing models. Kass and Raftery (1995) derive BIC as an approximation to twice the di erence in log integrated likelihoods, so that the di erence in BIC between two models approximates twice the logarithm of the Bayes factor. Hence, 2 log(BF ) BIC 2 log(BF ) ! 0: (2) However, 2 log(BF ) BIC 6! 0: (3) Equation (3) implies that, for general priors on the parameters, 2 log(BF ) BIC has a nonvanishing asymptotic error of constant order, i.e. of order O(1). Since the absolute value of BIC increases with n, the error tends to zero as a proportion of BIC. Therefore BIC has the undesirable property that for any constant k, BIC + k also approximates twice the log Bayes factor to the same order of approximation as BIC itself. This O(1) error suggests that the BIC approximation is somewhat crude, and may perform poorly for small samples. Kass and Wasserman (1995) show that with nested models, under a particular prior on the parameters, the constant order asymptotic error disappears, and they argue that this prior can reasonably be used for inference purposes. Following the notation of their paper, let Y = (y1; : : : ; yn) be iid observations from a family parameterized by ( ; ), with dim( ; ) = m and dim( )=m0. Our goal is to test H0 : = 0 against H1 : 2 It su ces to show the three conditions preceding (5) hold. (C1) holds by assumption.Conditions (C2) and (C3) hold by Theorem 3.2 in Andersen and Gill (1982).Theorem 3Again, condition (C1) holds by assumption, and (C2) holds by Andersen and Gill (1982).To prove (5), we need to show that (6) holds. To do this, we present a slight alteration ofthe proof in Andersen and Gill (1982). First, some notation is needed. LetNi(t) = I(Ti t; i=1) be the counting process for individual i;N(t) =PiNi(t) be the total counting process for the data;Yi(t) = I(Ti t) be the risk set at time t;S(0)( ; t) = 1n Pi Yi(t) exp(Z 0i )be the average risk in the risk set;0(t) = the baseline hazard rate at time t;and s(0)( ; x) a bounded function such that supx2[0; ] k S(0)( ; x) s(0)( ; x) k! 0.De ne the following empirical weighted covariance matrix:V( ; t) = Xi2Rt wi(Rt)fZiRt(Z)gfZiRt(Z)gTwhere Rt is the risk set at time t,Zi is the covariate vector of subject i,wi(Rt) = exp(Z 0i )Pi2Rt exp(Z 0i ) :and Rt(Z)=Pi2Rt wi(Rt)Zi. Then there exists a function v( ; x) such thatsupx2[0; ]kV ( ; x) v( ; x)k ! 0(Fleming and Harrington 1991, p.296). To complete the proof, note that1dI(̂; t)1q I( ; t) nd 1n Z t0 fV (̂; x) v(̂; x)gd N(x)+nd 1n Z t0 fv(̂; x) v( ; x)gd N(x)15 +nd 1n Z t0 1nv( 0; x)fd N(x)Xi Yi(x) exp(Z 0i ) 0(x)dxg+ 1q Z t0 v( 0; x)fS(0)( 0; x) s(0)( 0; x)g 0(x)dxg(18)The rst two terms on the right side of (18) converge to zero in probability by Lenglart'sinequality (Fleming and Harrington, Theorem 3.4.1). The third term converges to zero byan application of a corollary to Lenglart's inequality (Fleming and Harrington, Corollary3.4.1 and Lemma 8.2.1). The fourth term converges to zero by the de nition of s(0)( 0; x)and the fact that R0 0(x)dx <1: 2ReferencesAndersen, P. and R. D. Gill (1982). Cox's regression model for counting processes: A largesample study. Annals of Statistics 10, 1100{1120.Cox, D. R. (1972). Regression models and life tables (with discussion). Journal of theRoyal Statistical Society B 34, 187{220.Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269{276.Dickson, E. R., T. R. Fleming, and R. H.Weisner (1985). Trial of penicillamine in advancedprimary biliary cirrhosis. New England Journal of Medecine 312, 1011{1015.Fleming, T. R. and D. H. Harrington (1991). Counting Processes and Survival Analysis.New York: Wiley.Fried, L. P., N. O. Borhani, et al. (1991). The Cardiovascular Health Study: Design andrationale. Annals of Epidemiology 1, 263{276.Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society (Ser. B) 14,107{114.Kalb eisch, J. D. (1978). Nonparametric Bayesian analysis of survival time data. Journalof the Royal Statistical Society, Ser. B 40, 214{221.Kalb eisch, J. D. and R. L. Prentice (1973). Marginal likelihoods based on Cox's regressionand life model. Biometrika 60, 267{278.Kalb eisch, J. D. and R. L. Prentice (1980). The Statistical Analysis of Failure Time Data.Wiley.Kass, R. E. and A. E. Raftery (1995). Bayes factors. Journal of the American StatisticalAssociation 90, 773{795.16 Kass, R. E. and L. Wasserman (1995). A reference Bayesian test for nested hypotheseswith large samples. Journal of the American Statistical Association 90, 928{934.Kass, R. E. and L. Wasserman (1996). Formal rules for selecting prior distributions: Areview and annotated bibliography. Journal of the American Statistical Association 91,1343{1370.Lindley, D. V. (1957). A statistical paradox. Biometrika 44, 187{192.Madigan, D., J. Gavrin, and A. E. Raftery (1995). Elicting prior information to enhancethe predictive performance of Bayesian graphical models.Communications in StatisticsTheory and Methods 24, 2271{2292.Madigan, D. and A. E. Raftery (1994). Model selection and accounting for model uncer-tainty in graphical models using Occam's Window. J. American Statistical Associa-tion 89, 1535{1546.Manolio, T. A., R. A. Kronmal, G. L. Burke, et al. (1996). Short-term predictors ofincident stroke in older adults. Stroke 27, 1479{1486.Prentice, R. L. (1973). Exponential survivals with censoring and explanatory variables.Biometrika 60, 279{288.Raftery, A. E. (1995). Bayesian model selection in social research (with discussion). InP. Marsden (Ed.), Sociological Methodology 1995, pp. 111{195. Cambridge, Mass:Blackwells.Raftery, A. E., D. Madigan, and J. Hoeting (1997). Bayesian model averaging for linearregression models. Journal of the American Statistical Association 92, 179{191.Savage, I. R. (1957). Contributions to the theory of rank order statistics. The \trend"case. Annals of Mathematical Statistics 28, 968{977.Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 461{464.Spiegelhalter, D. J. and A. F. M. Smith (1982). Bayes factors for linear and log-linearmodels with vague prior information. Journal of the Royal Statistical Society, Ser.B 44, 377{387.Volinsky, C. T., D. Madigan, A. E. Raftery, and R. A. Kronmal (1997). Bayesian ModelAveraging in proportional hazard models: Assessing the risk of a stroke. Applied Statis-tics 46 (3), 443{448.17

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A unified framework for fitting Bayesian semiparametric models to arbitrarily censored survival data, including spatially-referenced data

A comprehensive, unified approach to modeling arbitrarily censored spatial survival data is presented for the three most commonly-used semiparametric models: proportional hazards, proportional odds, and accelerated failure time. Unlike many other approaches, all manner of censored survival times are simultaneously accommodated including uncensored, interval censored, current-status, left and ri...

متن کامل

A Bayesian neural network approach for modelling 4 censored data with an application to prognosis 5 after surgery for breast cancer

11 12 Abstract 13 14 A Bayesian framework is introduced to carry out Automatic Relevance Determination (ARD) in 15 feedforward neural networks to model censored data. A procedure to identify and interpret the 16 prognostic group allocation is also described. 17 These methodologies are applied to 1616 records routinely collected at Christie Hospital, in a 18 monthly cohort study with 5-year foll...

متن کامل

Model Selection Based on Tracking Interval Under Unified Hybrid Censored Samples

The aim of statistical modeling is to identify the model that most closely approximates the underlying process. Akaike information criterion (AIC) is commonly&nbsp;used for model selection but the precise value of AIC has no direct interpretation.&nbsp;In this paper we use a normalization of a difference of Akaike criteria in comparing&nbsp;between the two rival models under unified hybrid cens...

متن کامل

Bayesian information criterion for censored survival models.

We investigate the Bayesian Information Criterion (BIC) for variable selection in models for censored survival data. Kass and Wasserman (1995, Journal of the American Statistical Association 90, 928-934) showed that BIC provides a close approximation to the Bayes factor when a unit-information prior on the parameter space is used. We propose a revision of the penalty term in BIC so that it is d...

متن کامل

High-Dimensional Cox Regression Analysis in Genetic Studies with Censored Survival Outcomes

With the advancement of high-throughput technologies, nowadays high-dimensional genomic and proteomic data are easy to obtain and have become ever increasingly important in unveiling the complex etiology of many diseases. While relating a large number of factors to a survival outcome through the Cox relative risk model, various techniques have been proposed in the literature. We review some rec...

متن کامل

A SAS Procedure Based on Mixture Models for Estimating Developmental Trajectories

This article introduces a new SAS procedure written by the authors that analyzes longitudinal data (developmental trajectories) by fitting a mixture model. The TRAJ procedure fits semiparametric (discrete) mixtures of censored normal, Poisson, zero-inflated Poisson, and Bernoulli distributions to longitudinal data. Applications to psychometric scale data, offense counts, and a dichotomous preva...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998